Introduction to Open Data Science - Course Project

About the project

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.

My GitHib repository: https://github.com/kastematonen/IODS-project

My course diary: https://kastematonen.github.io/IODS-project/

Assignment 1

1. Check that you have everything installed and created according to the instructions. You should have a GitHub repository, a course diary web page (also on GitHub, in a different address) and the IODS-project started on RStudio using the course templates.

Should be all good.

2. Just write some of your thoughts about this course freely in the file, e.g., How are you feeling right now? What do you expect to learn? Where did you hear about the course?

I am feeling a bit overwhelmed with the number of files and instructions and books on the course. Not yet sure what to look at when doing something, or what to write down where when reading instructions and excercises.

I expect to learn a few more things on some basic statistical analyses and I just came accross the course when looking at the course selection in SISU.

Also reflect on your learning experiences with the R for Health Data Science book and the Exercise Set 1: How did it work as a “crash course” on modern R tools and using RStudio? Which were your favorite topics? Which topics were most difficult? Some other comments on the book and our new approach of getting started with R Markdown etc.? (All this is just “warmup” to get well started and learn also the technical steps needed each week in Moodle, that is, submit and review. We will start more serious work next week! You can already look at the next topic in Moodle and begin working with the Exercise Set 2…)

I’ve worked with Rmd and quarto documents before, so R markdown is at least somewhat familiar to me. The same goes with using git for version control, though I’m used t using it from the command line, so I’ve never used it from rstudio before.

About the excercise set, I read it though quickly and the contents all seemed familiar which was nice.

3. Open the index.Rmd file with RStudio. At the beginning of the file, in the YAML options below the ‘title’ option, add the following option: author: “Your Name”. Save the file and “knit” the document (there’s a button for that) as an HTML page. This will also update the index.html file.

Added this, my name is visible on the knitted index document.

4. To make the connection between RStudio and GitHub as smooth as possible, you should create a Personal Access Token (PAT).

Got an error from the command gitcreds::gitcreds_set() saying that it cannot find git. I have been using git on this compute before, so it should be there. Tried reinstalling the latest version (the one i had wasn’t the newest one) and the command gitcreds::gitcreds_set() again but got the same message.

Pushing local changes was not successfull with the way described in the course instructions, but I did manage to do it by using the push button from the git window in rstudio, where it asked for me to sign in to github (which i did with the newly create access token). When pushing changes the next time, it did not ask me to sign in again, so I guess adding the PAT this way is pretty much the same as adding it the way described in the instructions.

I guess this section wasn’t a complite success but using git still works with this ever so slightly different way, so i’d still consider this to be what was needed. I am used to using git from the command line, so i dont think it matters where you operate it, as long as it works.

5. Upload the changes to GitHub (the version control platform) from RStudio

This worked in the way described above.

6. After a few moments, go to your GitHub repository at https://github.com/your_github_username/IODS-project to see what has changed (please be patient and refresh the page). Also visit your course diary that has been automatically been updated at https://your_github_username.github.io/IODS-project and make sure you see the changes there as well.

Everything looked good, pushing my changes to github worked for i can see my additions to my course diary on the github page. I need to remember to save all the documents and to knit the index file before pushing changes to be sure that all the changes are actually pushed (now the latest addition of one sentence is not visible on the page).


2: Regression and model validation

Describe the work you have done this week and summarize your learning.

date()
## [1] "Tue Dec  5 08:12:57 2023"

Prep packages:

library(GGally)
## Warning: package 'GGally' was built under R version 4.3.2
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.2
## Warning: package 'readr' was built under R version 4.3.2
## Warning: package 'forcats' was built under R version 4.3.2
## Warning: package 'lubridate' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0
## ✔ readr     2.1.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Read in data:

data <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/learning2014.txt", sep = ",", header = T)

Explore the data:

dim(data)
## [1] 166   7
head(data)
##   gender age attitude     deep  stra     surf points
## 1      F  53      3.7 3.583333 3.375 2.583333     25
## 2      M  55      3.1 2.916667 2.750 3.166667     12
## 3      F  49      2.5 3.500000 3.625 2.250000     24
## 4      M  53      3.5 3.500000 3.125 2.250000     10
## 5      M  49      3.7 3.666667 3.625 2.833333     22
## 6      F  38      3.8 4.750000 3.625 2.416667     21
str(data)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

The data has 166 rows and 7 columns and has the columns gender , age, attitude, deep, stra, surf, and points.

See https://www.mv.helsinki.fi/home/kvehkala/JYTmooc/JYTOPKYS3-meta.txt for the original description of the data and the script create_learning2014.R for how the data was wrangled.

Graphical overview of the data:

pairs(data[-1]) # exclude the nonnumeric column gender

# include the class variable as a factor
data_factors <- data
data_factors$gender <- as.factor(data_factors$gender)
pairs(data_factors)

ggpairs(data_factors, mapping = aes(alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))

Some variables correlated more, some less. FOr example older study subjects have higher points, and attitude and points are linearly correlated.

Summaries of the variables in the data:

summary(data)
##     gender               age           attitude          deep      
##  Length:166         Min.   :17.00   Min.   :1.400   Min.   :1.583  
##  Class :character   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333  
##  Mode  :character   Median :22.00   Median :3.200   Median :3.667  
##                     Mean   :25.51   Mean   :3.143   Mean   :3.680  
##                     3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083  
##                     Max.   :55.00   Max.   :5.000   Max.   :4.917  
##       stra            surf           points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00

All variables apart from gender and age have been scaled to be from 0 to 5.

Linear regression:

# fit a linear model
my_model <- lm(points ~ attitude + stra + surf, data = data) 
# the three variables with the highest correlation to points in the plots above

# print out a summary of the model
summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

The variables stra and surf do not have a statistially significant association to the end variable points, so we are removing them from the model:

# fit a linear model
my_model <- lm(points ~ attitude, data = data) 

# print out a summary of the model
summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

If all other variables stay at a fixed value, the variable attitude increases points by 3.53 when it (attitude) increases by one unit. In the previous model above, stra has a slight increase for points when other variables are fixed, whereas surf decreases points.

The estimate or beta is the slope of the linear regression line, and the intercept is where the line hits the y-axis.

The statistical test related to the model parameters checks if the slope of the line being equal to zero. In the first model the p-values for stra and surf were > 0.05, so their slopes are likely zero, whereas attitude’s is likely different from zero.

The Adjusted R squared tells us about the proportion of the dependent variable which is explained by the explanatory variables. It is between 0 and 1, bigger being better for the model fit for then the model fits the data perfectly. We can see that our model is not the best for having a low R squared.

Diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage

par(mfrow = c(2,2))
plot(my_model, which = c(1,2,5))

The assumptions of the model are:

  1. Linear relationship between predictors and outcome;

    • From the first pairwise plots we drew in the beginning we can tell that the linear relationship assumpition is met.
    • this is also visible in the “residuals vs fitted values” plot where no curvature is seen
  2. Independence of residuals;

  3. Normal distribution of residuals;

    • in the qq plot the values roughly follow the straight line, being roughly normally distributed. The ends are a bit off and there are outliers on the left tail, which should be adressed to be sure that the distribution is actually normal. In the leverage plot we see outliers too, so this should likely be adressed.
  4. Equal variance of residuals.

    • in the plot “residuals vs fitted values” we can see that there is no one area where the residuals would be much smaller or larger than elsewhere, so they are equal

2: Logistic regression

date()
## [1] "Tue Dec  5 08:13:57 2023"

Prep packages:

library(GGally)
library(ggplot2)
library(dplyr)
library(tidyverse)
library(boot)

Read in data:

The data is from https://www.archive.ics.uci.edu/dataset/320/student+performance

data <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv", sep = ",", header = T)
colnames(data)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

We want to study the relationships between high/low alcohol consumption and some of the other variables in the data.

Choosing 4 interesting variables in the data and coming up with hypotheses for them:

  • “sex”: I think male students might have a higher alcohol cnsumption

  • “failures”: more failed classas might result in a higher alcohol consumption

  • “goout”: more going out with friends might result in a higher alcohol consumption

Numerically and graphically exploring the distributions of your chosen variables and their relationships with alcohol consumption:

# for sex:
data %>% group_by(sex, high_use) %>% summarise(count = n())
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## # A tibble: 4 × 3
## # Groups:   sex [2]
##   sex   high_use count
##   <chr> <lgl>    <int>
## 1 F     FALSE      154
## 2 F     TRUE        41
## 3 M     FALSE      105
## 4 M     TRUE        70
# in females, 41/(41+154) = 0.2102564 have high use
# in males, 70/(70+105) = 0.4 have high use

g1 <- ggplot(data, aes(x = high_use, y = sex)) + geom_boxplot() + ylab("sex")
g1 # not a great way to plot this

g2 <- ggplot(data = data, aes(x = high_use)) + geom_bar() + facet_wrap("sex")
g2 # this is better

#-------------------------------------------------------------------------------------------------
# for failures:
data %>% group_by(high_use) %>% summarise(count = n(), mean_failure = mean(failures))
## # A tibble: 2 × 3
##   high_use count mean_failure
##   <lgl>    <int>        <dbl>
## 1 FALSE      259        0.120
## 2 TRUE       111        0.351
# more failue in the group with high use

g1 <- ggplot(data, aes(x = high_use, y = failures)) + geom_boxplot() + ylab("failures")
g1 # not a great way to plot this

g2 <- ggplot(data = data, aes(x = high_use)) + geom_bar() + facet_wrap("failures")
g2 # this is better

#-------------------------------------------------------------------------------------------------
# for goout:
data %>% group_by(high_use) %>% summarise(count = n(), mean_going_out = mean(goout))
## # A tibble: 2 × 3
##   high_use count mean_going_out
##   <lgl>    <int>          <dbl>
## 1 FALSE      259           2.85
## 2 TRUE       111           3.73
# higher mean or going out more in the group with high use

g1 <- ggplot(data, aes(x = high_use, y = goout)) + geom_boxplot() + ylab("going out")
g1 # looks nice

g2 <- ggplot(data = data, aes(x = high_use)) + geom_bar() + facet_wrap("goout")
g2 # a plot for each going out level from 1 (the least) to 5 (the most)

It looks like a bigger part of the females are in the group with small consumption, as expected. In males the division is more even, more individuals with a bigger proportion in the high use group.

When there are 0 failures, most are in the low consumption class, whereas in the other groups the division is more even, and in the class with the most failures, all have high consumption. This is in accordance with the hypothesis.

In the high consumption class the individuals are more outgoing, as hypothesized.

Logistic regression On the variables chosen:

m <- glm(high_use ~ sex + failures + goout, data = data, family = "binomial")

# print out a summary of the model
summary(m)
## 
## Call:
## glm(formula = high_use ~ sex + failures + goout, family = "binomial", 
##     data = data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.7973     0.4565  -8.318  < 2e-16 ***
## sexM          0.8856     0.2531   3.500 0.000466 ***
## failures      0.5292     0.2288   2.313 0.020732 *  
## goout         0.7275     0.1195   6.089 1.13e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 383.39  on 366  degrees of freedom
## AIC: 391.39
## 
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(m)
## (Intercept)        sexM    failures       goout 
##  -3.7973068   0.8855768   0.5292036   0.7274660
# ORs and their CIs

OR <- coef(m) %>% exp
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
cbind(OR, CI)
##                    OR       2.5 %    97.5 %
## (Intercept) 0.0224311 0.008824332 0.0530417
## sexM        2.4243823 1.483028327 4.0075122
## failures    1.6975798 1.092049103 2.6922782
## goout       2.0698290 1.647312988 2.6341765

All p-values are < 0.05.

The model includes a categorical variable sex, for which the model p-value indicates that the level M differs from the level F in a statistically significant way. For categorical variables the OR indicated here is not the true OR. For sex-M it is OR(intercept) + OR(sexM) = 0.0224311 + 2.4243823 = 2.446813,indicating that it is a risk factor for high alcohol consumption.

The other ORs are all above 1, so the variables are also risk factors. They increase the outcome by their OR amounts when other variables are kept constant. For example, having a one-unit increase in failures increases the risk of high alcohol consumption by a unit of 1.7 when other variables are kept constant, and likewise for the other variable goout.

To know wether the whole variable sex improves the model fit, we could fit a model without it and compare the models with an anova test.

Explore the predictive power of the model:

# predict() the probability of high_use
probabilities <- predict(m, type = "response")

data <- mutate(data, probability = probabilities)
data <- mutate(data, prediction = probability > 0.5) # probability > 0.5 means it is in the class high consumptio

# tabulate the target variable versus the predictions
table(high_use = data$high_use, prediction = data$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   250    9
##    TRUE     77   34
# or like this
table(high_use = data$high_use, prediction = data$prediction) %>% prop.table() %>% addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.67567568 0.02432432 0.70000000
##    TRUE  0.20810811 0.09189189 0.30000000
##    Sum   0.88378378 0.11621622 1.00000000
# a plot of the results
g <- ggplot(data, aes(x = probability, y = high_use, col = prediction)) + geom_point()
g

# training error
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

loss_func(class = data$high_use, prob = data$probability)
## [1] 0.2324324
# Compare the performance of the model with performance achieved by some simple guessing strategy
# a literal guesser, having 0.5 chance at choosing the correct value
guesses <- sample(c(0,1), nrow(data), replace = T, )
loss_func(class = data$high_use, prob = guesses)
## [1] 0.5297297
# as expected, the guesser is right about half of the time, not being very efficient at predicting

Tabulating the predictions, in total 252 + 33 have been predicted right, while 78 + 7 have been missclassified. This results to a training error of 23%, which is better than that of a random guesser (49%). Thus, our model outperforms just guessing the class randomly by guite a bit.

10-fold cross-validation:

cv <- cv.glm(data = data, cost = loss_func, glmfit = m, K = nrow(data))
cv$delta[1]
## [1] 0.2567568

The test error is 26%, which is higher than the training error, as one might expect. It is also about the same compared to that of the model in the excercise set (26%).

Try finding a model with a smaller test error:

m <- glm(high_use ~ sex + failures + goout + age, data = data, family = "binomial")
cv <- cv.glm(data = data, cost = loss_func, glmfit = m, K = nrow(data))
cv$delta[1]
## [1] 0.2054054

The test error is 21%, which is now smaller than in the excercise.


4: Clustering and classification

date()
## [1] "Tue Dec  5 08:14:40 2023"

Prep packages:

library(dplyr)
library(MASS)
## Warning: package 'MASS' was built under R version 4.3.2
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.2
## corrplot 0.92 loaded
library(plotly)
## Warning: package 'plotly' was built under R version 4.3.2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(GGally)

Load the Boston data from MASS the R package, explore the structure and the dimensions of the data:

The dataset contains “Housing Values in Suburbs of Boston”.

More information on the data can be found here:

[https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html]

data("Boston")

A graphical overview of the data and summaries of the variables in the data:

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
pairs(Boston)

cor_matrix <- cor(Boston) %>% round(2)
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

# ggpairs(Boston, mapping = aes(alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))

The distribution of the variabes are different. Some, like zn and age, seem to be from zero to a hundred, some (like chas) from zero to one, and some have complitely different scales: tax from 187 to 711, black from 0.32 to 397.

Many of the variables are heavily correlated. For example, lstat and medv have a negative correlation of almost 1, and rad and tax a positive correlation of almost one.

Standardizing the dataset:

# scale the variables (all numeric)
boston_scaled <- scale(Boston)
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

Standardizing helps to unify variables in very different scales. The variables have been scaled and centered. In the standardized data, the mean of each column is zero now.

Create a categorical variable of the crime rate in the Boston dataset:

boston_scaled <- as.data.frame(boston_scaled) # we need this too
boston_scaled$crim <- as.numeric(boston_scaled$crim) # we need this too

crime <- cut(boston_scaled$crim, breaks = quantile(boston_scaled$crim), include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))

Drop the old crime rate variable from the dataset:

# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

Divide the dataset to train and test sets:

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

Fitting the linear discriminant analysis on the train set & drawing the LDA (bi)plot:

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2500000 0.2549505 0.2524752 0.2425743 
## 
## Group means:
##                   zn      indus        chas        nox          rm        age
## low       0.99350686 -0.9396143 -0.07742312 -0.8965071  0.51524373 -0.9169303
## med_low  -0.07553642 -0.3097554  0.03346513 -0.5933350 -0.08838442 -0.3690900
## med_high -0.38467135  0.1703288  0.22945822  0.3780274  0.10631529  0.3970349
## high     -0.48724019  1.0171960 -0.11163110  1.0351243 -0.44501505  0.8190637
##                 dis        rad        tax     ptratio      black       lstat
## low       0.9189849 -0.6964601 -0.7406749 -0.47383669  0.3771994 -0.79697970
## med_low   0.3468862 -0.5458997 -0.4753888 -0.01847604  0.3385514 -0.17811052
## med_high -0.3663826 -0.4076377 -0.2972633 -0.28785050  0.1197358  0.03960463
## high     -0.8525200  1.6373367  1.5134896  0.77985517 -0.6153849  0.87234759
##                 medv
## low       0.59057178
## med_low   0.03601909
## med_high  0.18958840
## high     -0.66337987
## 
## Coefficients of linear discriminants:
##                  LD1          LD2         LD3
## zn       0.089933844  0.636869798 -0.80498965
## indus   -0.012200952 -0.287967678  0.30619916
## chas    -0.096305193 -0.031587956  0.11556446
## nox      0.436658498 -0.744824436 -1.43774583
## rm      -0.104599011 -0.075037943 -0.18683753
## age      0.225156229 -0.280880219 -0.14290293
## dis     -0.097483043 -0.202398225 -0.09866470
## rad      3.174486329  0.882270376 -0.02882231
## tax      0.001242167  0.097669101  0.42836035
## ptratio  0.139254132  0.003680679 -0.19412399
## black   -0.159511821  0.023871116  0.15842635
## lstat    0.212494529 -0.203718932  0.25134453
## medv     0.210477092 -0.363740319 -0.28079055
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9506 0.0370 0.0124
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  graphics::arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results (select both lines and execute them at the same time!)
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale = 1)

Saving the crime categories from the test set and then removing the categorical crime variable from the test dataset:

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

Predicting the classes with the LDA model on the test data:

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       13      12        1    0
##   med_low    1      17        5    0
##   med_high   0       8       15    1
##   high       0       0        0   29

Many have been predicted correct (13+16+15+26) / (13+11+1+4+16+3+2+11+15+26) = 0.6862745, in total 68.6% are correct. Although may are still predicted wrong, especially in the low and med_high categories. The model works the best for the high category, where all are correct.

Reloading the Boston dataset and standardizing the dataset:

# from above
data("Boston")
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled) # we need this too

Calculating the distances between the observations:

# with euclidean distance

# euclidean distance matrix
dist_eu <- dist(boston_scaled)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

Running k-means algorithm:

# k-means clustering
km <- kmeans(boston_scaled, centers = 3) # trying with 3 clusters to begin with

# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)

Investigating the optimal number of clusters and running the algorithm again:

set.seed(123)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# k-means clustering
km <- kmeans(boston_scaled, centers = 2)

# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)

pairs(boston_scaled[1:5], col = km$cluster)

pairs(boston_scaled[6:10], col = km$cluster)

pairs(boston_scaled[11:13], col = km$cluster)

The optimal number of clusters is when the line drops a lot. Deciding on this is very subjective. One can choose two clusters for the slope being the biggest till that, or maybe 6 for there the descent evens out. We are going with 2 clusters now.

The variables separating the two groups in the pairs plots are, for instance, crim&zn and crim&nox.

Bonus section

Performing k-means on the original (standardized) Boston data:

# like above:
data("Boston")
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled) # we need this too

# k-means clustering
km <- kmeans(boston_scaled, centers = 3) # trying with 3 clusters 

# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)

Performing LDA using the clusters as target classes:

# not doing a train and test split here for it was not asked for
boston_scaled$cluster <- km$cluster

# linear discriminant analysis
# lda.fit <- lda(cluster ~ ., data = boston_scaled)
# Error in lda.default(x, grouping, ...) :
# variable 4 appears to be constant within groups
# -> run the model with all variables except for the fourth one
lda.fit <- lda(cluster ~ ., data = boston_scaled[,-4])

# print the lda.fit object
lda.fit
## Call:
## lda(cluster ~ ., data = boston_scaled[, -4])
## 
## Prior probabilities of groups:
##          1          2          3 
## 0.06916996 0.61067194 0.32015810 
## 
## Group means:
##         crim         zn      indus        nox         rm        age        dis
## 1 -0.2048299 -0.1564737  0.2306535  0.3342374  0.3344149  0.3170678 -0.3634565
## 2 -0.3882449  0.2731699 -0.6264383 -0.5823006  0.2188304 -0.4585819  0.4807157
## 3  0.7847946 -0.4872402  1.1450405  1.0384727 -0.4896488  0.8062002 -0.8383961
##           rad        tax    ptratio      black      lstat       medv
## 1 -0.02700292 -0.1304164 -0.4453253  0.1787986 -0.1976385  0.6422884
## 2 -0.58641200 -0.6161585 -0.2814183  0.3151747 -0.4640135  0.3182241
## 3  1.12436056  1.2034416  0.6329916 -0.6397959  0.9277624 -0.7457491
## 
## Coefficients of linear discriminants:
##                 LD1         LD2
## crim     0.02479166 -0.13204141
## zn       0.42787622 -0.04638198
## indus    1.15646011  0.64348753
## nox      0.47943272  0.42987408
## rm       0.13637610 -0.12823804
## age     -0.06654278  0.30029385
## dis     -0.01915297  0.20848367
## rad      0.74637979  0.69845574
## tax      0.27967651 -1.02040695
## ptratio  0.19355485 -0.21964359
## black   -0.04753224  0.11581547
## lstat    0.47213016  0.02172623
## medv     0.06797263  0.92531426
## 
## Proportion of trace:
##    LD1    LD2 
## 0.9822 0.0178

Visualizing the results with a biplot:

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  graphics::arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(boston_scaled$cluster)

# plot the lda results (select both lines and execute them at the same time!)
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale = 1)

The variables affecting the LDs are (for example, only the top 2 listed here): indus and rad for LD1, and tax and medv for LD2. The division is not as good or clear as in some other examples encountered in the excercises. Classes 1 and 2 seem to go together, and 3 be separate, but there is overlap fbetween them all.

Super bonus section:

# scaled train data from above
data("Boston")
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled) # we need this too
boston_scaled$crim <- as.numeric(boston_scaled$crim) # we need this too
crime <- cut(boston_scaled$crim, breaks = quantile(boston_scaled$crim), include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
# number of rows in the Boston dataset 
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set 
test <- boston_scaled[-ind,]

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
# lda.fit
# has teh LD1-3 in it

# k-means clustering also needed
# km <- kmeans(train, centers = 2) # 2 clusters here to match the excercise above where we chose 2 to be the optimal number
# Warning: NAs introduced by coercionError in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
# we need to make crime a numeric column (now factor)
train$crime <- as.numeric(train$crime)
km <- kmeans(train, centers = 2) # ok
# the example script copied here
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
# Next, install and access the plotly package. Create a 3D plot (cool!) of the columns of the matrix product using the code below.

# the original plot
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
# modifying the plot: Set the color to be the crime classes of the train set
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
# modifying the plot: color is defined by the clusters of the k-means
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km$cluster)

The plots have some differences and similarities. In all of them, there is some overlap in the grops, and having only 2 classes in k-means still doesnt differenciate the grups that well even though in the 3D plot there are two clear groups it could pick up on. In both plots the tighter cluster has mainly one colour/group/class label, whereas the sparcer cluster has more.


5: Dimensionality reduction techniques

date()
## [1] "Tue Dec  5 08:16:23 2023"

Prep packages:

library(dplyr)
library(readr)
library(corrplot)
library(GGally)
library(tibble)
library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 4.3.2

Read in data:

human <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/human2.csv")
## Rows: 155 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country
## dbl (8): Edu2.FM, Labo.FM, Life.Exp, Edu.Exp, GNI, Mat.Mor, Ado.Birth, Parli.F
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Move the country names to rownames:

human_ <- column_to_rownames(human, "Country")

A graphical overview of the data and summaries of the variables:

# visualize the 'human_' variables
ggpairs(human_, progress = FALSE)

# compute the correlation matrix and visualize it with corrplot
cor(human_)
##                Edu2.FM      Labo.FM   Life.Exp     Edu.Exp         GNI
## Edu2.FM    1.000000000  0.009564039  0.5760299  0.59325156  0.43030485
## Labo.FM    0.009564039  1.000000000 -0.1400125  0.04732183 -0.02173971
## Life.Exp   0.576029853 -0.140012504  1.0000000  0.78943917  0.62666411
## Edu.Exp    0.593251562  0.047321827  0.7894392  1.00000000  0.62433940
## GNI        0.430304846 -0.021739705  0.6266641  0.62433940  1.00000000
## Mat.Mor   -0.660931770  0.240461075 -0.8571684 -0.73570257 -0.49516234
## Ado.Birth -0.529418415  0.120158862 -0.7291774 -0.70356489 -0.55656208
## Parli.F    0.078635285  0.250232608  0.1700863  0.20608156  0.08920818
##              Mat.Mor  Ado.Birth     Parli.F
## Edu2.FM   -0.6609318 -0.5294184  0.07863528
## Labo.FM    0.2404611  0.1201589  0.25023261
## Life.Exp  -0.8571684 -0.7291774  0.17008631
## Edu.Exp   -0.7357026 -0.7035649  0.20608156
## GNI       -0.4951623 -0.5565621  0.08920818
## Mat.Mor    1.0000000  0.7586615 -0.08944000
## Ado.Birth  0.7586615  1.0000000 -0.07087810
## Parli.F   -0.0894400 -0.0708781  1.00000000
# this is copied straight from the exercises, i guess the visualization with corrplot is missing from there??
# try this here, i think we did something similar a week or two ago
corrplot(cor(human_))

Ggpairs gives the summaries of the variables: how they are distributed and correlated with one another. We can see that many variables are highly correlated, having high correlation values, either positive or negative, and that is well visualized in all the options above, the ggpaors, correlation matrix, and the corrplot. For example, life.exp and edu2.fm are highly correlated positively, and mat.mor and life.exp negatively.

The variables have different distributions, some heavily focussing on one edge, such as GNI and mat.mor, and some have more of a bell shape, like edu.exp.

PCA on non-standardized data:

pca_human <- prcomp(human_) # using the data where countries are rownames

# variability captured by the PCs
# create and print out a summary of pca_human
s <- summary(pca_human)
# rounded percentanges of variance captured by each PC
pca_pr <- round(1*s$importance[2, ], digits = 5)
# print out the percentages of variance
pca_pr
##    PC1    PC2    PC3    PC4    PC5    PC6    PC7    PC8 
## 0.9999 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
# biplot
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

# another kind of a biplot made in the exercises
# create object pc_lab to be used as axis labels
paste0(names(pca_pr), " (", pca_pr, "%)")
## [1] "PC1 (0.9999%)" "PC2 (1e-04%)"  "PC3 (0%)"      "PC4 (0%)"     
## [5] "PC5 (0%)"      "PC6 (0%)"      "PC7 (0%)"      "PC8 (0%)"
# copied straight from the exercises, not saved into a variable to actually be used as axis labels??
# draw a biplot
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = NA, ylab = NA)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

We can see that PC1 captures 99% of the variability/variance, and the rest what is left. In the biplot we see that the highly skewed (from the correlation plots a bit earlier) variable GNI has a large influence on PC1. Since the data is not scaled, we can assume that GNI has the highest numerical scale of the variables in the data, thus affecting the outcome of PCA heavily.

PCA on scaled data:

human_std <- scale(human_) # using the data where countries are rownames

pca_human_std <- prcomp(human_std)

# variability captured by the PCs
s_std <- summary(pca_human_std)
pca_pr <- round(1*s_std$importance[2, ], digits = 5)
pca_pr
##     PC1     PC2     PC3     PC4     PC5     PC6     PC7     PC8 
## 0.53605 0.16237 0.09571 0.07583 0.05477 0.03595 0.02634 0.01298
# biplot
biplot(pca_human_std, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))

Now that the variables are scaled, the first PC captures 54% of the variance, and the numbers for the rest are much bigger than before. This is also evident from the biplot where the weights for the other variables are comparable to each other.

The results are different when the data is not scaled and when it is. This is because PCA is kind of a clustering method, and for clustering, scaling the data beforehand is crucial. If not scaled, then variables may be on totally different units/scales/numerical values, and the variables with the largest numerical values will dominate the analysis. This is what we saw happening with unscaled data.

Include captions in the biplots:

The biplot function used is from stats base R package now, it does not include caption variable when examined from its help page. Other packages with biplot function do include that, for example PCAtools package has it. If we add caption to the plot, we only get warnings:

# non-scaled data:
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"), caption = "hei")
## Warning in plot.window(...): "caption" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "caption" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "caption" is not a
## graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "caption" is not a
## graphical parameter
## Warning in box(...): "caption" is not a graphical parameter
## Warning in title(...): "caption" is not a graphical parameter
## Warning in text.default(x, xlabs, cex = cex[1L], col = col[1L], ...): "caption"
## is not a graphical parameter
## Warning in plot.window(...): "caption" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "caption" is not a graphical parameter
## Warning in title(...): "caption" is not a graphical parameter
## Warning in axis(3, col = col[2L], ...): "caption" is not a graphical parameter
## Warning in axis(4, col = col[2L], ...): "caption" is not a graphical parameter
## Warning in text.default(y, labels = ylabs, cex = cex[2L], col = col[2L], :
## "caption" is not a graphical parameter
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

# scaled data: 
biplot(pca_human_std, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"), caption = "hei")
## Warning in plot.window(...): "caption" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "caption" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "caption" is not a
## graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "caption" is not a
## graphical parameter
## Warning in box(...): "caption" is not a graphical parameter
## Warning in title(...): "caption" is not a graphical parameter
## Warning in text.default(x, xlabs, cex = cex[1L], col = col[1L], ...): "caption"
## is not a graphical parameter
## Warning in plot.window(...): "caption" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "caption" is not a graphical parameter
## Warning in title(...): "caption" is not a graphical parameter
## Warning in axis(3, col = col[2L], ...): "caption" is not a graphical parameter
## Warning in axis(4, col = col[2L], ...): "caption" is not a graphical parameter
## Warning in text.default(y, labels = ylabs, cex = cex[2L], col = col[2L], :
## "caption" is not a graphical parameter

We also did not add captions to the biplots in the exercises, so i am including the information here as plain text instead. I am also not entirely sure to what extent we are meant to describe the results here, because captions are usually very, very brief, and upon interpreting the results I have already touched on the scaled/not issue, and the question after this want me to take a closer look at the PC axis. So i am combining the two topics in a somewhat brief “caption” next.

Describing the results using the actual phenomena the variables relate to:

Personal interpretations of the first two principal component dimensions based on the biplot drawn after PCA on the standardized data:

As just stated, the PC1 is defined by maternal mortality and birth rates to one direction, and GNI, expected years of schooling, and life expectancy at birth to the other. These seem like general good life indicators, or developing-country-or-not kind of variables by the looks of them.

PC2 is defined by parli.F and labo.fm which stand for Percetange of female representatives in parliament and the Proportion of females in the labour force compared to that of men. PC1 covers the basics for how good the quality of life may be, and PC2 seems to be a higher order measure for how developed a country is or how good the life there may be, because usually the number of female representatives is high in far-developed coutries, and the same goes for the proportion of females in the workforce.

Moving on to the Tea data.

Loading the tea dataset and converting its character variables to factors:

tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)

Exploring the data:

dim(tea)
## [1] 300  36
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
##  $ frequency       : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
summary(tea)
##          breakfast           tea.time          evening          lunch    
##  breakfast    :144   Not.tea time:131   evening    :103   lunch    : 44  
##  Not.breakfast:156   tea time    :169   Not.evening:197   Not.lunch:256  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##         dinner           always          home           work    
##  dinner    : 21   always    :103   home    :291   Not.work:213  
##  Not.dinner:279   Not.always:197   Not.home:  9   work    : 87  
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##         tearoom           friends          resto          pub     
##  Not.tearoom:242   friends    :196   Not.resto:221   Not.pub:237  
##  tearoom    : 58   Not.friends:104   resto    : 79   pub    : 63  
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##         Tea         How           sugar                     how     
##  black    : 74   alone:195   No.sugar:155   tea bag           :170  
##  Earl Grey:193   lemon: 33   sugar   :145   tea bag+unpackaged: 94  
##  green    : 33   milk : 63                  unpackaged        : 36  
##                  other:  9                                          
##                                                                     
##                                                                     
##                                                                     
##                   where                 price          age        sex    
##  chain store         :192   p_branded      : 95   Min.   :15.00   F:178  
##  chain store+tea shop: 78   p_cheap        :  7   1st Qu.:23.00   M:122  
##  tea shop            : 30   p_private label: 21   Median :32.00          
##                             p_unknown      : 12   Mean   :37.05          
##                             p_upscale      : 53   3rd Qu.:48.00          
##                             p_variable     :112   Max.   :90.00          
##                                                                          
##            SPC               Sport       age_Q          frequency  
##  employee    :59   Not.sportsman:121   +60  :38   +2/day     :127  
##  middle      :40   sportsman    :179   15-24:92   1 to 2/week: 44  
##  non-worker  :64                       25-34:69   1/day      : 95  
##  other worker:20                       35-44:40   3 to 6/week: 34  
##  senior      :35                       45-59:61                    
##  student     :70                                                   
##  workman     :12                                                   
##              escape.exoticism           spirituality        healthy   
##  escape-exoticism    :142     Not.spirituality:206   healthy    :210  
##  Not.escape-exoticism:158     spirituality    : 94   Not.healthy: 90  
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##          diuretic             friendliness            iron.absorption
##  diuretic    :174   friendliness    :242   iron absorption    : 31   
##  Not.diuretic:126   Not.friendliness: 58   Not.iron absorption:269   
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##          feminine             sophisticated        slimming          exciting  
##  feminine    :129   Not.sophisticated: 85   No.slimming:255   exciting   :116  
##  Not.feminine:171   sophisticated    :215   slimming   : 45   No.exciting:184  
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##         relaxing              effect.on.health
##  No.relaxing:113   effect on health   : 66    
##  relaxing   :187   No.effect on health:234    
##                                               
##                                               
##                                               
##                                               
## 
View(tea)

Multiple Correspondence Analysis (MCA):

# leaving in only some of the variables
# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
# select the 'keep_columns' to create a new dataset
tea_time <- select(tea, one_of(keep_columns))

mca <- MCA(tea_time, graph = FALSE)

# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.279   0.261   0.219   0.189   0.177   0.156   0.144
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519   7.841
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953  77.794
##                        Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.141   0.117   0.087   0.062
## % of var.              7.705   6.392   4.724   3.385
## Cumulative % of var.  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139   0.003
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626   0.027
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111   0.107
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841   0.127
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979   0.035
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990   0.020
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347   0.102
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459   0.161
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968   0.478
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898   0.141
##                     v.test     Dim.3     ctr    cos2  v.test  
## black                0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            2.867 |   0.433   9.160   0.338  10.053 |
## green               -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone               -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                3.226 |   1.329  14.771   0.218   8.081 |
## milk                 2.422 |   0.013   0.003   0.000   0.116 |
## other                5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag             -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged          -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |
# visualize MCA

# plot the individuals
plot(mca, invisible=c("ind"), graph.type = "classic", habillage = "quali") 

# plot the variables
plot(mca, invisible=c("var"), graph.type = "classic")

There seem to be some variable levels which go together more often than others. For example, unpackaged tea is often bought from tea shops, whereas tea bags from chain stores. There is not as big a distance between how the tea is drank (alone/milk/lemon) or if sugar is added, or if it is drank at lunch or not. The tea quality for green tea is a hint towards unpackaged tea shop tea, but earl grey and black tea are closer to chain stores, tea bags, and a combination of the shop combinations and teabag/unpackaged combinations.


(more chapters to be added similarly as we proceed with the course!)